ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill by happyyzy · Pull Request #22755 · ggml-org/llama.cpp

happyyzy · 2026-05-06T11:42:01Z

Summary

This PR adds an opt-in Adreno xmem GEMM path for OpenCL prefill matmul.

Scope:

build-time gated by GGML_OPENCL_USE_ADRENO_KERNELS
runtime opt-in via GGML_OPENCL_ADRENO_XMEM_GEMM=1
limited to Adreno A8X
limited to F16 x F32 -> F32 GGML_OP_MUL_MAT
limited to contiguous, single-batch GEMM shapes
requires N > 1, so token-generation / GEMV decode is not routed through this path

The implementation keeps the existing ggml tensor layout externally and uses a small bridge around the xmem GEMM:

pack F32 activations into a half image
pack F16 weights into the xmem kernel layout
run the Adreno xmem OS8 GEMM
store the half image result back to F32 output

The generic OpenCL matmul path remains unchanged unless the new runtime opt-in is set.

Results

Tested on Adreno 830 with OpenCL:

OpenCL driver: OpenCL 3.0 QUALCOMM build: 0800.71 Compiler E031.47.18.49
build: 09294365a (468)

Qwen2.5 1.5B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        204.98 ± 0.61 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.84 ± 0.07 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           pp512 |        356.19 ± 0.31 |
| qwen2 1.5B F16                 |   2.88 GiB |     1.54 B | OpenCL     |  99 |           tg128 |         18.98 ± 0.06 |

Prefill improved from 204.98 tok/s to 356.19 tok/s, about 1.74x.

Qwen2.5 3B F16

Before, baseline OpenCL:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        101.26 ± 0.04 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.53 ± 0.04 |

After, with GGML_OPENCL_ADRENO_XMEM_GEMM=1:

| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           pp512 |        163.90 ± 2.84 |
| qwen2 3B F16                   |   5.75 GiB |     3.09 B | OpenCL     |  99 |           tg128 |          9.51 ± 0.11 |

Prefill improved from 101.26 tok/s to 163.90 tok/s, about 1.62x.

Decode is intentionally unchanged. Decode-only profiling confirmed that token generation stays on the existing OpenCL path (adreno_xmem count = 0).

Correctness

Checked end-to-end generation with the xmem path enabled on Qwen2.5 1.5B F16 and Qwen2.5 3B F16. Both models produced normal decode output.

Notes

This path depends on Qualcomm Adreno OpenCL subgroup constant-load extensions and is therefore guarded behind the existing Adreno kernel build option plus an explicit runtime environment variable.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI was used as an assistant for code review, testing coordination, and drafting text from user-provided results. The contributor is responsible for the submitted changes.

ggml-gh-bot · 2026-05-06T11:46:16Z

Hi @happyyzy, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

happyyzy · 2026-05-06T13:28:35Z

Thanks for the note. I closed the other open PR (#22117) and will focus on this smaller xmem GEMM PR first.

lhez · 2026-05-07T17:55:53Z

Thank you - this is much easier. Will take a closer look in the next few days.

lhez · 2026-05-08T04:29:34Z

I was able to reproduce your results on A840 with Qwen3-1.7B-f16. My device has

ggml_opencl: device: 'QUALCOMM Adreno(TM) 840 (OpenCL 3.0 Adreno(TM) 840)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: 0842.27.1 Compiler E031.50.19.18

With GGML_OPENCL_ADRENO_XMEM_GEMM=1,

ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: Adreno xmem F16xF32 GEMM enabled (temporary weight prepack)
ggml_opencl: loading OpenCL kernels......................................................................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 840 (OpenCL 3.0 Adreno(TM) 840)'

model	size	params	backend	ngl	test	t/s
qwen3 1.7B F16	3.78 GiB	2.03 B	OpenCL	99	pp512	394.43 ± 24.45
qwen3 1.7B F16	3.78 GiB	2.03 B	OpenCL	99	tg128	13.60 ± 0.08

build: f79069d9d (9050)

Without GGML_OPENCL_ADRENO_XMEM_GEMM,

ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels......................................................................................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 840 (OpenCL 3.0 Adreno(TM) 840)'

model	size	params	backend	ngl	test	t/s
qwen3 1.7B F16	3.78 GiB	2.03 B	OpenCL	99	pp512	169.50 ± 0.25
qwen3 1.7B F16	3.78 GiB	2.03 B	OpenCL	99	tg128	13.55 ± 0.04

build: f79069d9d (9050)

lhez · 2026-05-08T05:28:11Z

    cl_program program_mul_mv_f32_f32;
    cl_program program_mul;
    cl_program program_mul_mat_f16_f32_tiled;
+    cl_program program_adreno_xmem_gemm_f16_f32;


Let's remove this cl_program object. Instead, use a local cl_program in load_cl_kernels and release it when done. Something like this,

cl_program prog = build_program_from_source(backend_ctx->context, backend_ctx->device, kernel_src.c_str(), CL_moe_compile_opts); CL_CHECK((backend_ctx->kernel_gemv_moe_q4_0_f32_ns = clCreateKernel(prog, "kernel_gemv_moe_q4_0_f32_ns", &err), err)); CL_CHECK(clReleaseProgram(prog)); GGML_LOG_CONT(".");

There is no need to keep the cl_program objects and we plan to remove them.

lhez · 2026-05-08T06:11:04Z

+#ifdef GGML_OPENCL_USE_ADRENO_KERNELS
+    backend_ctx->adreno_xmem_gemm_enabled = getenv("GGML_OPENCL_ADRENO_XMEM_GEMM") != nullptr &&
+                                             backend_ctx->gpu_family == GPU_FAMILY::ADRENO &&
+                                             backend_ctx->adreno_gen == ADRENO_GPU_GEN::A8X;


backend_ctx->gpu_family == GPU_FAMILY::ADRENO is enough and you don't need to check for A8x.

I think you can safely assume the two extensions for xmem always exist on modern Adreno GPUs (I think they even go back to A6x). In case they are not supported, the kernel compilation will fail.

happyyzy · 2026-05-08T12:15:54Z

Thanks, addressed both comments: the xmem program is now local to load_cl_kernels() and released after kernel creation, and the runtime gate now checks Adreno family instead of A8x only.

ggml-opencl: add Adreno xmem F16xF32 GEMM for prefill

c5e0577

happyyzy requested a review from a team as a code owner May 6, 2026 11:42

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels May 6, 2026

happyyzy mentioned this pull request May 6, 2026

ggml-opencl: add Adreno xmem attention path #22117

Closed

lhez reviewed May 8, 2026

View reviewed changes

ggml-opencl: address Adreno xmem review comments

a7e7032

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755

ggml-opencl: add opt-in Adreno xmem F16xF32 GEMM for prefill#22755
happyyzy wants to merge 2 commits intoggml-org:masterfrom
happyyzy:adreno-xmem-gemm-prefill

happyyzy commented May 6, 2026

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

lhez commented May 7, 2026

Uh oh!

lhez commented May 8, 2026

Uh oh!

lhez May 8, 2026

Uh oh!

lhez May 8, 2026

Uh oh!

happyyzy commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

happyyzy commented May 6, 2026

Summary

Results

Qwen2.5 1.5B F16

Qwen2.5 3B F16

Correctness

Notes

Requirements

Uh oh!

ggml-gh-bot Bot commented May 6, 2026

Uh oh!

happyyzy commented May 6, 2026

Uh oh!

lhez commented May 7, 2026

Uh oh!

lhez commented May 8, 2026

Uh oh!

lhez May 8, 2026

Choose a reason for hiding this comment

Uh oh!

lhez May 8, 2026

Choose a reason for hiding this comment

Uh oh!

happyyzy commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants